18 research outputs found

    General Purpose Computation on Graphics Processing Units Using OpenCL

    Computational Science has emerged as a third pillar of science along with theory and experiment, where the parallelization for scientific computing is promised by different shared and distributed memory architectures such as, super-computer systems, grid and cluster based systems, multi-core and multiprocessor systems etc. In the recent years the use of GPUs (Graphic Processing Units) for General purpose computing commonly known as GPGPU made it an exciting addition to high performance computing systems (HPC) with respect to price and performance ratio. Current GPUs consist of several hundred computing cores arranged in streaming multi-processors so the degree of parallelism is promising. Moreover with the development of new and easy to use interfacing tools and programming languages such as OpenCL and CUDA made the GPUs suitable for different computation demanding applications such as micromagnetic simulations. In micromagnetic simulations, the study of magnetic behavior at very small time and space scale demands a huge computation time, where the calculation of magnetostatic field with complexity of O(Nlog(N)) using FFT algorithm for discrete convolution is the main contribution towards the whole simulation time, and it is computed many times at each time step interval. This study and observation of magnetization behavior at sub-nanosecond time-scales is crucial to a number of areas such as magnetic sensors, non volatile storage devices and magnetic nanowires etc. Since micromagnetic codes in general are suitable for parallel programming as it can be easily divided into independent parts which can run in parallel, therefore current trend for micromagnetic code concerns shifting the computationally intensive parts to GPUs. My PhD work mainly focuses on the development of highly parallel magnetostatic field solver for micromagnetic simulators on GPUs. I am using OpenCL for GPU implementation, with consideration that it is an open standard for parallel programming of heterogeneous systems for cross platform. The magnetostatic field calculation is dominated by the multidimensional FFTs (Fast Fourier Transform) computation. Therefore i have developed the specialized OpenCL based 3D-FFT library for magnetostatic field calculation which made it possible to fully exploit the zero padded input data with out transposition and symmetries inherent in the field calculation. Moreover it also provides a common interface for different vendors' GPUs. In order to fully utilize the GPUs parallel architecture the code needs to handle many hardware specific technicalities such as coalesced memory access, data transfer overhead between GPU and CPU, GPU global memory utilization, arithmetic computation, batch execution etc. In the second step to further increase the level of parallelism and performance, I have developed a parallel magnetostatic field solver on multiple GPUs. Utilizing multiple GPUs avoids dealing with many of the limitations of GPUs (e.g., on-chip memory resources) by exploiting the combined resources of multiple on board GPUs. The GPU implementation have shown an impressive speedup against equivalent OpenMp based parallel implementation on CPU, which means the micromagnetic simulations which require weeks of computation on CPU now can be performed very fast in hours or even in minutes on GPUs. In parallel I also worked on ordered queue management on GPUs. Ordered queue management is used in many applications including real-time systems, operating systems, and discrete event simulations. In most cases, the efficiency of an application itself depends on usage of a sorting algorithm for priority queues. Lately, the usage of graphic cards for general purpose computing has again revisited sorting algorithms. In this work i have presented the analysis of different sorting algorithms with respect to sorting time, sorting rate and speedup on different GPU and CPU architectures and provided a new sorting technique on GPU

    Analysis of Fast Parallel Sorting Algorithms for GPU Architectures

    Sorting algorithms have been studied extensively since past three decades. Their uses are found in many applications including real-time systems, operating systems, and discrete event simulations. In most cases, the efficiency of an application itself depends on usage of a sorting algorithm. Lately, the usage of graphic cards for general purpose computing has again revisited sorting algorithms. In this paper we extended our previous work regarding parallel sorting algorithms on GPU, and are presenting an analysis of parallel and sequential bitonic, odd-even and rank-sort algorithms on different GPU and CPU architectures. Their performance for various queue sizes is measured with respect to sorting time and rate and also the speed up of bitonic sort over odd-even sorting algorithms is shown on different GPUs and CPU. The algorithms have been written to exploit task parallelism model as available on multi-core GPUs using the OpenCL specification. Our findings report minimum of 19x speed-up of bitonic sort against odd-even sorting technique for small queue sizes on CPU and maximum of 2300x speed-up for very large queue sizes on Nvidia Quadro 6000 GPU architectur

    Deep Learning-Based Multi-Modal Ensemble Classification Approach for Human Breast Cancer Prognosis

    Ensemble models based on deep learning have made significant contributions to the medical field, particularly in the area of disease prediction. Breast cancer is a highly aggressive disease with a high mortality rate. Timely and effective prediction of breast cancer can reduce the risk of it progressing to later stages and the need for unnecessary medications. While previous studies have focused on predicting breast cancer using single-modal datasets, multi-modal datasets that include gene expression (gene exp), clinical, and copy number variation (CNV) data have become available in recent years for predictive model development. However, despite multiple studies using multi-modal data for disease prediction, models designed for breast cancer are typically homogeneous neural networks. This article proposes a heterogeneous deep learning-based ensemble model for effective breast cancer prediction using multi-modal data. The model consists of three phases: feature extraction, stacked feature set creation, and using extracted features as input for a stacked-based model using a random forest algorithm for effective prediction. For feature extraction, convolutional neural networks (CNNs) are used for clinical and gene expression data, and deep neural networks (DNNs) are used for CNV data. The extracted features from CNNs and DNNs are stacked to create a comprehensive feature set. The simulation results demonstrate the superiority of the proposed framework in terms of accuracy compared to uni-modal and homogeneous model-multi-modal frameworks

    Fast parallel sorting algorithms on GPUs

    This paper presents a comparative analysis of the three widely used parallel sorting algorithms: Odd-Even sort, Rank sort and Bitonic sort in terms of sorting rate, sorting time and speed-up on CPU and different GPU architectures. Alongside we have implemented novel parallel algorithm: min-max butterfly network, for finding minimum and maximum in large data sets. All algorithms have been implemented exploiting data parallelism model, for achieving high performance, as available on multi-core GPUs using the OpenCL specification. Our results depicts minimum speed-up19x of bitonic sort against odd-even sorting technique for small queue sizes on CPU and maximum of 2300x speed-up for very large queue sizes on Nvidia Quadro 6000 GPU architecture. Our implementation of full-butterfly network sorting results in relatively better performance than all of the three sorting techniques: bitonic, odd-even and rank sort. For min-max butterfly network, our findings report high speed-up of Nvidia quadro 6000 GPU for high data set size reaching 2^24 with much lower sorting tim

    Parallel butterfly sorting algorithm on GPU

    Efficient sorting is vital for overall performance of the underlying application. This paper presents Butterfly Network Sort (BNS) for sorting large data sets. A minimal version of the algorithm Min-Max Butterfly is also shown for searching minimum and maximum values in data. Both algorithms are implemented on GPUs using OpenCL exploiting data parallelism model. Results obtained on different GPU architectures show better performance of butterfly sorting in terms of sorting time and rate. The comparison of butterfly sorting with other algorithms:bitonic, odd-even and rank sort show significant speedup improvements against all on Nvidia Quadro-6000 GPU with relatively better sorting time and rat

    Ensemble Framework of Deep CNNs for Diabetic Retinopathy Detection

    Diabetic retinopathy (DR) is an eye disease that damages the blood vessels of the eye. DR causes blurred vision or it may lead to blindness if it is not detected in early stages. DR has five stages, i.e., 0 normal, 1 mild, 2 moderate, 3 severe, and 4 PDR. Conventionally, many hand-on projects of computer vision have been applied to detect DR but cannot code the intricate underlying features. Therefore, they result in poor classification of DR stages, particularly for early stages. In this research, two deep CNN models were proposed with an ensemble technique to detect all the stages of DR by using balanced and imbalanced datasets. The models were trained with Kaggle dataset on a high-end Graphical Processing data. Balanced dataset was used to train both models, and we test these models with balanced and imbalanced datasets. The result shows that the proposed models detect all the stages of DR unlike the current methods and perform better compared to state-of-the-art methods on the same Kaggle dataset